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Semantic extraction for images is an urgent problem and is applied in many 
different semantic retrieval systems. In this paper, a semantic-based image 
retrieval (SBIR) system is proposed based on the combination of growth 
partitioning tree (GP-Tree), which was built in the authors’ previous work, 
with a self-organizing map (SOM) network and neighbor graph (called 
SgGP-Tree) to improve accuracy. For each query image, a similar set of 
images is retrieved on the SgGP-Tree, and a set of visual words is extracted 
relying on the classes obtained from mask region-based convolutional neural 
networks (R-CNN), as the basis for querying semantic of input images on 
ontology by simple protocol and resource description framework query 
language (SPARQL) query. The experiment was performed on image 
datasets, such as ImageCLEF and MS-COCO, with precision values of 
0.898453 and 0.875467, respectively. These results are compared with 
related works on the same image dataset, showing the effectiveness of the 


methods proposed. 
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1. INTRODUCTION 

In recent years, there have been many research groups to improve the efficiency of semantic-based 
image retrieval (SBIR) based on the built ontology [1]—[5]; built the image retrieval systems based on natural 
language analysis to generate a simple protocol and resource description framework query language (SPARQL) 
query that searches images set relied on image description resource description framework (RDF) [6]—[10]; 
proposed the image retrieval system based on relevant feedback techniques [11]; the image retrieval relied on 
ontology applying to text queries, multimedia data or to determine relationships between images by through 
image annotations and features [9], [12]-[17]. However, the set of similar images obtained has not really 
responded to user needs because of the difference between computational representations in machines and 
natural language. With the aim of minimizing the semantic gap to improve the performance of image retrieval. 

The published works show that the image retrieval problem has many interests of the authors. 
Furthermore, applying a hierarchical clustering tree to perform semantic-based similar image retrieval is a 
viable and challenging approach. On the basis of inheriting from existing works and overcoming the limitations 
of related published methods [18]—[20], a semantic image retrieval system by combining graph-GPTree and 
self-organizing map (SIR-SgGP) is built. After classifying the query image based on mask region-based 
convolutional neural networks (R-CNN), the SPARQL statement is generated to query semantics and extract 
the uniform resource identifier (URI) of the images on an ontology structure that we proposed. 
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Our ontology-based model is proposed to support two main functions for semantic retrieval for image 
datasets: 1) retrieving a similar image set of given images and 2) mapping low-level features into high-level 
semantics of images based on ontology. A hierarchical clustering tree, called GP-Tree, was published in [20] 
to automatically store images indexed from low-level features of an image. The advantages of GP-Tree are 
multi-branch tree and clustering of feature vectors, so it can store large amounts of data and retrieve images 
quickly. However, the retrieval on the GP-Tree is performed by finding the branch with the highest similarity 
to the query image. Therefore, the query performance is not really high, so it is necessary to improve the 
retrieval efficiency on the GP-Tree. 

This paper proposes a model to combine GP-Tree with self-organizing map (SOM) network and 
neighbor graph to limit the omission of similar elements occurring during branching in order to improve the 
accuracy of image retrieval. The input image is segmented to determine the classes of objects using the mask 
region-based convolutional neural networks (R-CNN) network. For each image segment, low-level features 
are extracted to form a combined feature descriptor. These feature sets will be retrieved on GP-Tree to extract 
a similar set of images. The SPARQL query is generated based on the visual words obtained from the set of 
classes by the mask R-CNN network and is queried on the ontology to extract the semantics of the query 
image. Experimental results on ImageCLEF and MS-COCO datasets and compared with published results in 
related works to evaluate the effectiveness of the proposed method. The contributions of the article include: 
1) proposed a model combining GP-Tree with SOM network and neighbor graph (SgGP-Tree) to improve 
image retrieval efficiency; and 2) proposed a semantic-based image retrieval model that combines machine 
learning SgGP-Tree and ontology. The rest of the paper presents the necessary steps on the image query 
method according to the semantic approach as the main contribution of the paper (part 2), the experimental 
results on the datasets, as well as the assessment are presented in section 3; some conclusions are presented in 
the final section. 


2. RESEARCH METHOD 
2.1. Image segmentation and classification of objects in the image 

In this paper, the pre-trained mask R-CNN model is used to detect objects in the image; from there, 
determine the classes for the input image. Figure 1 depicts the results of object recognition and classification 
on MS-COCO dataset by mask R-CNN based on ResNet-101-FPN [21]. For each extracted image segment, 
low-level features (color, texture, shape) are used to form an associative feature descriptor [22]. 
An 81-dimensional low-level feature vector is extracted for the image retrieval system in this paper. 


Figure 1. Mask R-CNN results using ResNet-101-FPN on images in the MS-COCO dataset 


2.2. Description of GP-Tree 

GP-Tree [20] consists of a root, a set of nodes T, and a set of leaves L. Nodes are connected through 
the path of the parent-child relationship. The leaves L, which are nodes without child nodes, contain element 
data ņ so that L = {n, |L = 1..M}, in which M is the maximum number of elements in a leaf. The element 
data n = (f,T,c) contain the following elements: the feature vector of an image f, the identifier of an 
image T, and classes of the image c. The nodes T have at least two child nodes, which contain the center 
element u so that T = {u,|k =1..N }, where N is the number of elements in a node. Each element of a node 
is linked to its adjacent child node through t of that node. The representative element u = (fo, l,o) contains 
the following components f, , or the center of feature vectors at a child node that has the path linking l to yw, 
and ø, which is the value used to check if the next subcluster has must be a leaf or not. 
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2.3. Neighbor cluster graph 
Image retrieval on GP-Tree has not achieved high performance in the case the splitting node are 
many times. When splitting a leaf, similar elements can be split into separate branches. The graph-GPTree 
neighbor graph is built based on the set of leaves of the GP-Tree. New leaves created during splitting a leaf 
will be marked neighbors according to different criteria in order to link related leaves together, avoiding 
missing data in the retrieval process, thereby increasing performance in retrieving similar images. Based on 
the above analysis, the neighbor cluster graph is defined as: 
a. Definition 1. The graph-GPTree neighbor cluster graph 
Graph-GPTree G = (V, E) is an undirected graph, including: 
— The set of vertexes V The set of vertexes is are the clusters of leaves of the GP-Tree; 
— The set of edges E S V x V are the links of a pair of leaves, formed according to the neighbor 
relationship. 
b. Definition 2. Neighbor cluster graph 
— The neighbor level-Ist: let foent; foent; be the center vector of the leaves L;, L;, respectively, where 


D Hj» 
feent, = averagetfilf; € Lit = 1.. |Lil}, feent; = averagetflfj € Lj, j = 1.. |L} 
if dg (ane Pen) < 0, then L;, Lj are neighbor level-1st, with dg be the Euclidean distance and 8 


be a given threshold value. 
— The neighbor level-2nd: let m,n be the number of classes of the image appearing in two leaves 
Le, Ly, respectively; C+, Cę be the class that occurs most of those two leaves, where: 
C= max{count (nj.c;)| ni E Lai = 1.. |Lil, j = 1..m}, 
Ck = max{count(nj.¢;)| ni E Lyi = 1.. |Lkl j = 1..m}: 
if cy = Cp, then Le, L, are neighbor level-2nd. 
Algorithm 1, the Algorithm of splitting a leaf and creating graph-GPTree is: 


Algorithm 1. Algorithm neighbor graph: split the leaf on GP-Tree and create graph-GPTree 


Input: Threshold 6, Leaf node L, Graph-GPTree; 
Output: Graph-GPTree 
Function neighborGraph (8, Leaf, Graph-GPTree) 
Begin 
Find the two furthest elements in a leaf node 
center = average{L.EDi.f, i=1..M}; 
left = argmax{Euclidean(center, Leaf.ED i.f), i=1..M}; 
right = argmax{Euclidean(L.ED left.f, L.ED i.f), i=1 
EDLeft = L.ED left; EDRight = L.ED right; 

Create new two leaf nodes 

Lı = L, U EDLeft; L, = L,U EDRight; 

Allocates elements to two new leaf nodes 

Foreach ed in L do 

If (Euclidean(ed.f, EDLeft.f < Euclidean(ed.f, EDRight.f)) then L =L U 


ed; 
Else L,= L,U ed; EndIf 
EndForeach 
Create center elements for two nodes: Lı & Ly 
ECLeft = average{L, .ED i.f, i=1..|L, |}; 
ECRight = average{L,.ED i.f, i=1..|L,|}; 
Update presentation elements to parent 
L.Parent = Leaf.Parent U {ECLeft, ECRight}; 
Determine the lst-level neighbors of the two newly split leaf nodes 
f (Euclidean (ECLeft, ECRight) < 98) then 
Neighbor[1].L, = Neighbor[1].L, U {L,}; 
Neighbor[1].L,= Neighbor[1].L,U {Li }; 
EndIf 
Determine the 2nd-level neighbors of the two newly split leaf nodes 
LeftClass = argmax{count (L .ED_i.cla), i=1..|L, |}; 
RightClass = argmax{count(L,.ED i.cla), i=1..|L,|}; 
f (LeftClass = RightClass) then 
Neighbor[2].L, = Neighbor[2].L, U {Lr}; 
Neighbor [2].L,= Neighbor[2].L,U {L, }; 
EndIf 
Graph GPTree = Graph GPTree U {Neighbor[1], Neighbor[2]}; 
Return Graph GPTree; 


End 
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2.4. GP-Tree-graph-SOM 

The graph-GPTree neighbor cluster graph has solved most of the problems of GP-Tree, improving 
image retrieval performance. However, the cluster selection criterion on the graph is by measure, so it can 
lead to measurement error when the tree performs node splitting many times; because when splitting a leaf, 
two new leaves may not create neighbors, but the representative elements of the split leaf (the most elements) 
have been allocated to the two new leaf nodes. Now more criteria are needed to select the winning leaf based 
on the weight of the representative element of the leaf. Therefore, the SOM network is built on the GP-Tree 
and graph-GPTree, called SgGP-Tree; to form a model that combines tree - graph - SOM. 

The SgGP-Tree network consists of two layers: input and output. The input layer is the image feature 
vectors f = (fi, fo, - fin), where each f; vector has n dimensions f; = (v4, V2, ..-, Vn). The output layer is the 
neuron containing the leaf set of the GP-Tree. The number of neurons is the number of representative classes of 
the leaf set of the GP-Tree and the neurons are labeled according to the classes; the leaf set of the GP-Tree is 
allocated to the neurons based on the representative classes of the leaves. The input layer is fully connected to 
the output layer by the weight vector set W = (W,, W3, ... Wp), where p is the number of neurons of the output 
layer and W; = (w1, W2, ... Wn). Figure 2 depicts the structure of SgGP-Tree network based on GP-Tree and 
graph-GPTree graph. On the basis of SOM network, the SgGP-Tree network structure is defined as: 

a. Definition 3. The SgGP-Tree network 

The SgGP-Tree network is a SOM network whose input is the feature vectors of an image 
f = (fo fo, fin), where each f; vector has n dimensions f; = (v4, V2, ».-, Vn), fi E€ {0,1} and the output layer 
is the neurons containing the leaf set of the GP-Tree. The input and output layers are fully connected by the 
weight vectors W; = (Wy, W2, ...Wy), wi E {0,1}. 

The purpose of the SgGP-Tree network is to classify the input data. The SgGP-Tree training process 
is the weight-training process. Adjusting the weights will make the network achieve the best classification 
requirements. However, the weight adjustment process takes a lot of time with large input datasets and 
incorrect random weighting. Therefore, instead of taking a random weight, a set of weight vectors trained on 
the GP-Tree is extracted. This weight vector is defined as: 

b. Definition 4. Weight vector 

Let w be the weight vector of data elements y at the leaf. The weight vector w is the center of the 

feature vectors of the most frequently occurring classes in the leaf. 


w= dizi fi (1) 


n 

Where f; is the feature vector value of the n classes that appear the most. The SgGP-Tree training process 
includes the following steps: 
— Step 1: allocate leaves from GP-Tree to SgGP-Tree network. 
— Step 2: initialize the initial weight w; from the set of weights obtained during GP-Tree training. 
— Step 3. randomly select a vector f; as the training sample. 
— Step 4: find the winning neuron using Euclidean distance. 

i(f,) = argmin,|| fi - w;|| 
— Step 5: Update the weights of the neurons. 

wj =w; + v(t) — Wj) 
Where: y(t) = Yo exp (—) is learning rate, t is epoch 


— Step 6: repeat step 3 until training is over. 


1} | The leaves of 
iz j GP-Tree 


Weight wi 


\f 


E a: i. 
fy fe ft 


Input vectors 


Figure 2. The SgGP-Tree combined network structure 


A method for semantic-based image retrieval using hierarchical clustering tree and ... (Nguyen Minh Hai) 


1030 O ISSN: 1693-6930 


2.5. Query image semantics on ontology 

In order to query images according to the semantic approach, an ontology framework for images is 
proposed, using an ImageCLEF [22] image dataset. SPARQL is a query language on data sources described 
as RDF or web ontology language (OWL) triples. With the input query image that can contain one object or 
many objects, mask R-CNN is used to extract the visual words vector; this vector contains one or more 
semantic classes of the image, and automatically generates a SPARQL statement (and/or), from which to 
query the ontology to find the annotation of the query image [22]. The query result on the ontology is a set of 
URIs and the metadata of the query image. 


3. RESULTS AND DISCUSSION 
3.1. Datasets 

To demonstrate the effectiveness of the proposed method, two popular image datasets are used for 
testing: ImageCLEF and MS-COCO. The ImageCLEF dataset consisting of 20,000 images and 276 class 
labels. Each image represents a single object and is annotated with a number of semantically relevant text 
tags. The MS-COCO dataset consists of 123,000 and 80 class labels. Each image consists of many objects 
and there is a caption for each object in the image. 


3.2. The proposed SBIR system 
The semantic-based image retrieval system based on SgGP-Tree and ontology is called SIR-SgGP. 

The query system consists of two phases, preprocessing and image retrieval. Figure 3 depicts the architecture 
of SIR-SgGP consisting of two specific phases as: 
a.  Pre-processing phase 

1) the input image is segmented to determine the class of the objects in the image, and at the same time, 
the low-level features of the objects are extracted; from there, create data samples representing the image set 
of feature vectors and corresponding classes. Then, allocate the dataset on GP-Tree; 2) create a combined 
model GP-Tree-graph-SOM from the set of GP-Tree leaves; and 3) build a semi-automatic ontology 
framework from the WWW dataset and image dataset. 
b. Image retrieval phase 

1) create a representative data sample for the query image consisting of feature vectors and 
corresponding classes; 2) first, perform retrieval on SgGP-Tree to find the winning cluster. From there, based 
on the neighbor graph to retrieve the similar image set to the query image; 3) the visual words vector is 
created based on the classes of the query image obtained; from which the SPARQL query is generated and 
executes the query on a built ontology framework; the result of the query is the annotation of the query 
image; and 4) combining the results obtained from 2) and 3), we get a set of similar images and annotation of 
the query image. 
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Figure 3. The SIR-SgGP model 


3.3. Application 

For each input image, the feature vectors and image classes are extracted using mask R-CNN. The set 
of feature vectors stored on the GP-Tree basing on the Euclidean similarity measure. With a query image, 
the SIR-SgGP system extracts features and classifies the image by mask R-CNN and searches the image 
according to the content on the SgGP-Tree association network to find a set of similar images according to the 
content. From the obtained classes of the query image, visual words are extracted; at the same time, 
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the SPARQL query is also automatically generated to query the semantics of the input image on the ontology. 
Figure 4(a) describes the classification of objects in the image, annotations query corresponding to the 
objects, and image features extraction. Figure 4(b) depicts the image set similar to the query image extracted 
based on the image features in the SIR-SgGP system. 
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Figure 4. A similar image retrieval result of the SIR-SgGP system: (a) the results of classification, feature 
extraction, and annotations query for the 000000003827.jpg image of the MS-COCO dataset; and (b) the 
image set is similar to the query image extracted by the SIR-SgGP system 


4. EXPERIMENTAL EVALUATION 

To evaluate image search efficiency, the article uses evaluation factors including precision, recall and 
F-measure, and query time (milliseconds). On the basis of the available performance values, the values about 
performance and time search average for ImageCLEF and MS-COCO datasets of SIR-SgGP is summarized in 
Table 1. To assess the accuracy and efficiency of the proposed image retrieval system, the experimental results 
are compared with other studies on the same image datasets. Table 2 and Table 3 show that the retrieval results 
of the proposed method are relatively accurate compared to semantic-based image retrieval systems. The data 
of the tables show that our proposed method has higher accuracy when compared with other retrieval 
methods on the same image dataset. This proves that our proposed method is effective. 


Table 1. Performance of the image retrieval system SIR-SgGP 
Image dataset __ Average precision Average recall _ Average F-measure Average query time (ms) 
ImageCLEF 0.898453 0.823434 0.813241 633.356 
MS-COCO 0.875467 0.723454 0.783452 746.345 
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Table 2. Comparison of mean average precision of methods on ImageCLEF dataset 
Method Mean average precision (MAP) 
D. Hu, 2019 [23] 0.643344 
D. Wang, 2018 [24] 0.655644 
SIR-SgGP 0.898453 
Table 3. Comparison of mean average precision of methods on MS-COCO dataset 
Method MAP 
Y. Cao, 2018 [25] 0.857645 
Y. Xie, 2020 [26] 0.862848 
SIR-SgGP 0.875467 
CONCLUSION 


In this paper, a semantic-based image retrieval method is proposed with the combination of 


GP-Tree-graph-SOM (SgGP-Tree). For each input image, features and image classes are extracted by mask 
R-CNN to create a visual word vector. From there, the SPARQL query is automatically generated from the 
visual word vector and executes a query on the ontology to retrieve the similar image set and annotation of 
the query image. An image retrieval model based on SgGP-Tree and ontology (SIR-SgGP) is proposed and 
experimented on ImageCLEF and MS-COCO datasets with the accuracy of 0.898453 and 0.875467, 
respectively. The experimental results are compared with other studies on the same set of image datasets, 
showing that our proposed method has higher accuracy. In the future research direction, we continue to improve 
the feature extraction methods of images to further improve the similar image set retrieval performance. 
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